129 research outputs found
Data Leakage via Access Patterns of Sparse Features in Deep Learning-based Recommendation Systems
Online personalized recommendation services are generally hosted in the cloud
where users query the cloud-based model to receive recommended input such as
merchandise of interest or news feed. State-of-the-art recommendation models
rely on sparse and dense features to represent users' profile information and
the items they interact with. Although sparse features account for 99% of the
total model size, there was not enough attention paid to the potential
information leakage through sparse features. These sparse features are employed
to track users' behavior, e.g., their click history, object interactions, etc.,
potentially carrying each user's private information. Sparse features are
represented as learned embedding vectors that are stored in large tables, and
personalized recommendation is performed by using a specific user's sparse
feature to index through the tables. Even with recently-proposed methods that
hides the computation happening in the cloud, an attacker in the cloud may be
able to still track the access patterns to the embedding tables. This paper
explores the private information that may be learned by tracking a
recommendation model's sparse feature access patterns. We first characterize
the types of attacks that can be carried out on sparse features in
recommendation models in an untrusted cloud, followed by a demonstration of how
each of these attacks leads to extracting users' private information or
tracking users by their behavior over time
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Mixture-of-Experts (MoE) models have gained popularity in achieving
state-of-the-art performance in a wide range of tasks in computer vision and
natural language processing. They effectively expand the model capacity while
incurring a minimal increase in computation cost during training. However,
deploying such models for inference is difficult due to their large size and
complex communication pattern. In this work, we provide a characterization of
two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT)
and identify their sources of inefficiencies at deployment. We propose three
optimization techniques to mitigate sources of inefficiencies, namely (1)
Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show
that dynamic gating improves maximum throughput by 6.21-11.23 for LM,
5.75-10.98 for MT Encoder and 2.58-5.71 for MT Decoder. It also
reduces memory usage by up to 1.36 for LM and up to 1.1 for MT.
We further propose Expert Buffering, a new caching mechanism that only keeps
hot, active experts in GPU memory while buffering the rest in CPU memory. This
reduces static memory allocation by up to 1.47. We finally propose a
load balancing methodology that provides additional scalability to the
workload
Thermal-aware 3D Microarchitectural Floorplanning
Next generation deep submicron processor design will need to take into consideration many performance limiting factors. Flip flops are inserted in order to prevent global wire delay from becoming nonlinear, enabling deeper pipelines and higher clock frequency. The move to 3D ICs will also likely be used to further shorten wirelength. This will cause thermal issues to become a major bottleneck to performance improvement. In this paper we propose a floorplanning algorithm which takes into consideration both thermal issues and profile weighted wirelength using mathematical programming. Our profile-driven objective improves performance by 20% over wirelength-driven. While the thermal-driven objective improves temperature by 24% on average over the profile-driven case
GPU-based Private Information Retrieval for On-Device Machine Learning Inference
On-device machine learning (ML) inference can enable the use of private user
data on user devices without revealing them to remote servers. However, a pure
on-device solution to private ML inference is impractical for many applications
that rely on embedding tables that are too large to be stored on-device. In
particular, recommendation models typically use multiple embedding tables each
on the order of 1-10 GBs of data, making them impractical to store on-device.
To overcome this barrier, we propose the use of private information retrieval
(PIR) to efficiently and privately retrieve embeddings from servers without
sharing any private information. As off-the-shelf PIR algorithms are usually
too computationally intensive to directly use for latency-sensitive inference
tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR
with the downstream ML application to obtain further speedup. Our GPU
acceleration strategy improves system throughput by more than over
an optimized CPU PIR implementation, and our PIR-ML co-design provides an over
additional throughput improvement at fixed model quality. Together,
for various on-device ML applications such as recommendation and language
modeling, our system on a single V100 GPU can serve up to queries per
second -- a throughput improvement over a CPU-based baseline --
while maintaining model accuracy
DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference
Neural personalized recommendation is the corner-stone of a wide collection
of cloud services and products, constituting significant compute demand of the
cloud infrastructure. Thus, improving the execution efficiency of neural
recommendation directly translates into infrastructure capacity saving. In this
paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that
adopts an algorithm and system co-design methodology to custom-design systems
for recommendation use cases. Leveraging the insights from the recommendation
characterization, a new dynamic scheduler, DeepRecSched, is proposed to
maximize latency-bounded throughput by taking into account characteristics of
inference query size and arrival patterns, recommendation model architectures,
and underlying hardware systems. By doing so, system throughput is doubled
across the eight industry-representative recommendation models. Finally,
design, deployment, and evaluation in at-scale production datacenter shows over
30% latency reduction across a wide variety of recommendation models running on
hundreds of machines
Adaptive transaction scheduling for transactional memory systems
Transactional memory systems are expected to enable parallel
programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only
when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations will actually become dominant in future workloads when large-scale transactions are frequently executed.
In this thesis, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.M.S.Committee Chair: Lee, Hsien-Hsin; Committee Member: Blough, Douglas; Committee Member: Yalamanchili, Sudhaka
Kernel Formula Approach to the Universal Whitham Hierarchy
We derive the dispersionless Hirota equations of the universal Whitham
hierarchy from the kernel formula approach proposed by Carroll and Kodama.
Besides, we also verify the associativity equations in this hierarchy from the
dispersionless Hirota equations and give a realization of the associative
algebra with structure constants expressed in terms of the residue formulas.Comment: 18 page
- …